Nagpure et al. BMC Genomics 2013, 14:630 
http://www.biomedcentral.eom/1 471 -21 64/1 4/630 



DATABASE Open Access 



FishMicrosat: a microsatellite database of 
commercially important fishes and shellfishes of 
the Indian subcontinent 

Naresh Sahebrao Nagpure 1 , lliyas Rashid 1 , Rameshwar Pati 1 , Ajey Kumar Pathak 2 , Mahender Singh 1 ", 
Shri Prakash Singh 2 and Uttam Kumar Sarkar 2 



Abstract 

Background: Microsatellite DNA is one of many powerful genetic markers used for the construction of genetic 
linkage maps and the study of population genetics. The biological databases in public domain hold vast numbers 
of microsatellite sequences for many organisms including fishes. The microsatellite data available in these data 
sources were extracted and managed into a database that facilitates sequences analysis and browsing relevant 
information. The system also helps to design primer sequences for flanking regions of repeat loci for PCR 
identification of polymorphism within populations. 

Description: FishMicrosat is a database of microsatellite sequences of fishes and shellfishes that includes important 
aquaculture species such as Lates calcarifer, Ctenopharyngodon idella, Hypophthalmichthys molitrix, Penaeus 
monodon, Labeo rohita, Oreochromis niloticus, Fenneropenaeus indicus and Macrobrachium rosenbergii. The database 
contains 4398 microsatellite sequences of 41 species belonging to 15 families from the Indian subcontinent. 
GenBank of NCBI was used as a prime data source for developing the database. The database presents information 
about simple and compound microsatellites, their clusters and locus orientation within sequences. The database 
has been integrated with different tools in a web interface such as primer designing, locus finding, mapping 
repeats, detecting similarities among sequences across species, and searching using motifs and keywords. In 
addition, the database has the ability to browse information on the top 10 families and the top 10 species, through 
record overview. 

Conclusions: FishMicrosat database is a useful resource for fish and shellfish microsatellite analyses and locus 
identification across species, which has important applications in population genetics, evolutionary studies and 
genetic relatedness among species. The database can be expanded further to include the microsatellite data of 
fishes and shellfishes from other regions and available information on genome sequencing project of species of 
aquaculture importance. 
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Background 

Microsatellites are observed in almost all known eukar- 
yotic and prokaryotic genomes, present in both coding 
and non-coding regions. They have a high mutation rate 
(between 10" 3 and 10" 4 mutations per gamete per ge- 
neration) that generates and maintains extensive length 
polymorphism [1,2]. This makes microsatellite a po- 
werful genetic marker for a variety of applications like 
population genetics, genetic linkage mapping, parentage 
assignment, marker assisted selection, molecular bree- 
ding, and allele mining [3,4]. A microsatellite locus 
generally varies in length between 5 to 40 repeats. Di-, 
tri- and tetranucleotide repeats are the most common 
choices for molecular genetic studies. Dinucleotides are 
an abundant type of microsatellite repeat found in 
most vertebrates, whereas trinucleotide repeats are most 
abundant in plants [5-7]. Microsatellites represent ideal 
molecular markers because they have multiple alleles 
that are highly polymorphic among individuals and loci 
that are highly abundant and dispersed evenly through- 
out eukaryotic genomes. The major drawback of using 
microsatellite is that for most species they need to be 
developed de novo, a process that is often costly and 
protracted [8]. Efforts have been made worldwide to 
compile and develop online and offline microsatellite da- 
tabases of biological organisms [9-15]. Valuable studies 
have been done in fishes such as microsatellite genetic 
linkage maps [16-20], characterization and identification 
of microsatellites [21-24] and cross-species microsa- 
tellite locus identification [25-27]. Despite the impor- 
tance of microsatellite markers, meagre efforts have 
been made to develop a microsatellite database of the 
fishes except Danio rerio [28], Cyprinus carpio [29] 
and Fishgen [30]. 

In this article, we describe the development of a 
microsatellite database (FishMicrosat) for population ge- 
netics and stock management using LAMPP (Linux- 
Apache-MySQL-PHP-Perl) technology and GenBank of 
NCBI as a data source to extract the microsatellite data. 
FishMicrosat is a unique database of microsatellite se- 
quences that covers commercially important fish and 
shellfish species of the Indian subcontinent. The data- 
base currently contains 4398 sequences of 41 species be- 
longing to 15 families and provides information on the 
type of repeat in terms of mono-, di-, tri-, tetra-, penta 
and hexanucleotide, simple and compound microsatel- 
lite, along with the characteristic of repeats namely size, 
region, pattern & unit. Additionally, algorithms were im- 
plemented for finding loci across species, based on the 
presence of identical simple sequence repeats (SSRs) 
with the same or varying frequencies of repeat units but 
conserved flanking regions. The database is regularly 
updated based on the release of new records in GenBank 
for the existing 41 species as well as the addition of new 



species belonging to the Indian sub continent. It is 
expected that the database will be a valuable resource in 
many aspects of fish genetic research of the Indo-Pacific 
region, Bay of Bengal and Arabian Sea. 

Construction and content 

Data source 

Microsatellite sequences of fish and shellfish species 
were downloaded from Entrez of NCBI [31] using the 
keyword search 'Fish microsatellite' under nucleotide. 
Files were downloaded in GenBank and FASTA format 
for annotation and sequences respectively. Further, a 
Perl program (SpciesExtractor.pl) was written and used 
for data extraction for only important species found in 
the Indian subcontinent from the downloaded files. 
Other physical information about the species like 
habitat, distribution, IUCN Red List status was collec- 
ted from FishBase [32]. Another Perl parsing program 
(InformationParser.pl) was developed to extract the 
information from the files according to the database 
schema and manage the data into the database. These 
Perl programs are used by the database administrator 
for updating based on new releases of microsatellite 
sequences for existing and new species. 

Design and development 
Database 

In order to manage the data, MySQL, a relational data- 
base management system, was used for building the 
database. Tables were designed and relationships among 
tables were created using unique, primary and foreign 
keys. Five tables were designed to store the information 
about microsatellite sequences and species. Table 'fishinfo' 
contains the physical and phenotypical information; 
satellite_sources' holds details molecular information 
about microsatellites; satellite' works as a bridge between 
tables 'fishinfo' and satellite_sources'; 'taxonomy shows 
systematic information of the species and acts as a sub 
table of Tishinfo'. And finally the table 'repeats' covers the 
data about repeats of all microsatellites sequences ob- 
tained by using the repeat analysis program 'MISA' [33] as 
shown in Figure 1. 

Web interface 

A web interface integrated with the database was 
designed and developed to retrieve and access the infor- 
mation of interest using web technologies like PHP, 
HTML, CSS, JavaScripts, DBI (Database Interface), CGI 
(Common Gateway Interface), GD (Graphic Design) and 
Perl. The web interface also incorporates the different 
tools for searching, viewing and analysing the microsa- 
tellite data (Figure 1). 
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Figure 1 Architecture and data flow representation in 'FishMicrosat'. 



Identification of microsatellite loci across species 

The microsatellite loci among the existing sequences 
were identified by implementing an algorithm into a 
program (locusfinder.pl) using Perl In order to con- 
struct the algorithm, a microsatellite sequence of se- 
lected species was divided in parts (a) motif of repeats 
region (b) 25 bp flanking sequence upstream and (c) 
25 bp flanking sequence downstream to the repeat re- 
gion. The repeat region and motifs present in sequences 
of selected species were fetched from the 'repeats' table 
of the database sequentially for retrieving identical target 
motifs and its sequences. Further, the conserved flanking 
regions were checked in query as well as target se- 
quences. The evolutionary conservation of the flanking 
region allows hetero specific identification of SSRs [34]. 
These conserved flanking regions have been used for de- 
signing PCR primers for microsatellite amplification and 
genotyping of individuals of the same species as well as 
across species [35,36]. Thus, to identify loci across spe- 
cies, an algorithm was designed by considering the ap- 
proach for example ABC as a repeat pattern and V the 



number of repeat units in a selected query sequence. 
The same repeat pattern 'ABC was used to check its 
availability and repeat frequency (denoted as P) in the 
target sequence (Figure 2). Here, because the repeat fre- 
quency may be polymorphic, the value of the repeat fre- 
quency in the selected query sequence (L) may or may 
not be equal to the repeat frequency in the target se- 
quence (P) i.e. L = P or L! = P. The algorithm uses a 
25 bp length of flanking region on either side, which is 
sufficient for amplification of a microsatellite locus in a 
PCR reaction for laboratory validation. The loci identifi- 
cation program supports the findings of the previous 
studies that microsatellite repeats vary within and bet- 
ween different genomes of organisms [37,38]. 

Search and analysis 

Apart from locus finding across species, other search 
and analysis modules like Keyword search', 'Repeat ana- 
lysis and primer', 'Motifs search', and 'Repeats map' were 
implemented and integrated in the web interface for 
browsing information. The 'Keyword search' takes a 
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Figure 2 Flowchart of the locus finder algorithm. 



word like 'fish name', common name', 'family', accession 
number' and author' typed in by the user as input and 
performs the search. The 'Motif search' takes input pa- 
rameters like motifs, repeat length and repeat type (sim- 
ple or composite) and returns the result with the help of 
regular expression programming and SQL (Structured 
Query Language) concepts. The 'Repeats map' was de- 
veloped using Perl, MISA and Blastn [39] programs for 
identifying similarities among the sequences and map- 
ping the repeats. The Blastn program uses 'blastmsdb' 
database, which is a blast compatible and created apart 
from the main database by using 'formatdb' program of 
blast package. 'Repeats map' analyzes and process the in- 
put query sequence through the MISA program to gen- 
erate the repeats. If repeats are found, it further leads to 



alignment with other similar sequences; otherwise the 
program terminates with a warning message. Primer3 
program [40,41] was used for primer designing, and a 
standalone version was downloaded [42] to compute 
multiple sets of forward and reverse primers for micro- 
satellite loci along with melting temperature (Tm), GC 
content, start position and product size. These generated 
primers can be used in PCR reactions for identification 
of polymorphic loci for genotyping of individuals. 

Implementation of 'statistics' 

The MISA program was implemented to ascertain the 
frequently occurring repeat types and repeat informa- 
tion from all the sequences in FishMicrosat. The results 
obtained from MISA were parsed and stored in the 
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repeats' table of the database. The 'GD graph' module 
was used to design and dynamically display the fre- 
quency of different types of repeats (mono to hexa) in 
the 'Pie diagram'. The 'Pie diagram' presents the fre- 
quency of each type of repeat and is revised when the 
database is updated. 

Discussion and utility 

FishMicrosat covers 4398 microsatellite records of 41 
commercially important aquaculture species belonging 
to 15 families (Table 1). The Home page of the web 
interface of FishMicrosat integrated with different ana- 
lytical modules, presents the numerical statistics of the 
top 10 families and species in addition to providing in- 
formation on updating and current status of the data- 
base. The 'Top 10 FishMicrosat families' displays ten 
families which has the largest number of species in 
FishMicrosat and the 'Top 10 FishMicrosat Species' dis- 
plays the ten species for which the largest number of 
specimen records are available in the database. The page 
also provides an overview of FishMicrosat and its fea- 
tures. The analytical tools like motif, sequence similarity 
search and repeat mapping, and finding microsatellite 
loci across species were integrated to increase the utility 
and scope of the database. 

Browsing specimen information 

The specimen records of the species of interest can be 
viewed by using the species instantiation index under 
the 'Record overview' menu item in the web interface. 
Hyperlinked navigational indexes by first letter of the 
generic name have been provided to find the species of 
interest along with the number of specimen records in 
square brackets. Further, selection of each species name 
is hyperlinked and a mouse click over the species name 
presents information on family, common name, habitat, 
distribution, microsatellite repeats, its region and size, 
sequence length, authors and NCBI accession number. 
The NCBI accession number for each specimen record 
also has a hyperlink to NCBI. The 'Top 10 FishMicrosat 
families' and 'Top 10 FishMicrosat Species' on the 
home page of the web interface provides other means 
of viewing information about the species and its 
specimens. 

Keyword search 

The keyword search works on keywords like species 
name, common name, family name, accession number, 
and author for retrieving the information from the da- 
tabase. Different views have been created for all these 
keywords to present relevant information from the data- 
base. For example the species name or common name 
as an input keyword leads to record overview. The 
'author view' displays a list of all the species on which 



the particular author worked and also displays the speci- 
men records which corresponds to the listed species. 
Similarly, family name and accession number keywords 
also lead to respective views (Figure 3). 

Repeat analysis and primer design 

The menu item 'Analysis & primer' (Figure 4A) detects 
repeats in the sequences and designs the primer for the 
selected repeat locus. Thus, to obtain the repeat infor- 
mation and design primers for a specific repeat, the end 
user selects a species of interest starting with a generic 
name. Clicking on the species name provides a table that 
contains information such as accession no., SSR no., 
SSR type, SSR motif, SSR size, position, sequence length 
and a link for primer design for each specimen (Figure 4B). 
For primer designing, the 'Primer3' standalone program 
computes primers upon user request for microsatellite se- 
quences that have suitable length of flanking regions and 
ample GC content in that region; otherwise the request is 
rejected with a warning message (Figure 4C). The pro- 
gram displays a list of multiple primers along with respect- 
ive values for Tm, GC content, start position and product 
size (oligo size). The primer sequences will be useful in 
determining the alleles and finding of loci across species. 

Motif search 

A repeat motif can be searched from the menu item 
'Motif search' integrated in the web interface (Figure 5A). 
It searches repeats in all microsatellite sequences present 
in the database and fetches information on species 
name, family, repeats, size, repeat region, NCBI refer- 
ences, and primers for SSRs (Figure 5B). Three input 
values are required under 'Motif search': 'Motif for 
nucleotide pattern (mono-hexa), 'Length' for number 
of nucleotides (i.e. > 10) and 'Repeats type' (simple or 
compound). The search results provide a primer link 
that leads to the design of the primer for the corre- 
sponding repeat type. 

Locus finder 

The Locus finder tab accesses identical microsatellite 
loci across species based on conserved flanking se- 
quences (approximately 25 bp long) on either side of the 
polymorphic loci. The program uses two input parame- 
ters; length of flanking region and species name. Finding 
the identical microsatellite locus in other species existing 
in the FishMicrosat database, is highly useful for cross 
species amplification of microsatellite loci. For example 
the sequence of Labeo rohita (GenBank accession no. 
AY291597) and Catla catla (GenBank accession no. 
AJ294957) contain the same motif with conserved flan- 
king regions on the parameter: flanking region '20 bp' 
and species 'Labeo rohita. The sequence alignments are 
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Table 1 Distribution of SSR's by species 
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Figure 3 Search results for different keywords in 'FishMicrosat'. 



88% identitical, indicating homology between sequences 
(Figure 5C). 

Repeats map 

Repeat mapping and sequence similarity searching can 
be achieved through the menu item 'Repeats map' in- 
cluded in the web interface. The program accepts micro- 
satellite sequences in FASTA format as input in the 
provided text area. The output presents information on 
repeats (size of query sequence, presence of compound/ 
composite repeats, number of identified repeats in query 
sequence SSRs, SSRs number, repeats location and size 
of repeats) along with summary on alignment of identi- 
cal/similar sequences. The alignment summary presents 
targeted sequences accession no., species name, target 
length, gaps, matches and identity between query and 
targeted sequences (Figure 5D). The program initially 



checks the presence or absence of the repeats in the in- 
put sequence and assigns a boolean value. If the value is 
true the program processes the query sequences by 
using Blastn program and its compatible 'blastmsdb' 
database for similar sequence searches. Thus, it helps 
to find information about repeats orientation and se- 
quence similarity for the newly generated microsatel- 
lite sequences. 

Repeat statistics 

In order to determine the frequency of different types 
of repeats from the specimen records available in 
FishMicrosat, the menu item 'Statistics' generates the 
frequency of each motif found in the database and 
displays the top three (most common) motifs with the lar- 
gest frequencies. For example, the statistic view shows that 
repeats AC was found 998 times, 'TC 909 and 'CA 881 
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Repeats analysis and Primer design 

This module provides information about repeals (type, number, sequence) and enables to 
design primers for selected locus. In order to view the information about repeats and to 
design primers, alphabets index facilitates to display the list of species along with total 
specimen records in square bracket. A clicking over event on the species name provides 
information in table for microsatellite sequences, The last column of the table showing primer 
design link and mouse clicking upon that link leads further for designing of suitable pair of 
primers. 
BpaclM index 

hBCQEEfiHIJSLMaQEQBSIUlflMSgZl 



t CBtcQfifer 975 
Penaeus monodon 618 
Lsboo ratitla 53 1 



Lata cateart fier [975] LabaorotMa |S3-1 J 
Laoeo cal&asu [ 1 0] La t^oLsyocftattus { 1 0] 



l abeo fimbriates [1 7] L abeo pangus> s ji 1} 
Laoeo ctu-ssL'tnien [7] Laoeo sata [4] 



8 



r 



1CAJ10 

no* 

won 

4CAJB 

HWI 

(GT>H 

<«)« 



(TBJJ 
0GT>I! 
crattz 
from 
cerm 



ITGjll 
(ICJ1* 

ItAJtl 



15 13 J? 

*e zee »■ 

3* 231 34* 

H 73 48 

iz in ih 

■< z» fsa 

34 M 1)1 

29 « » 
M 2»)7» 
M ZJ7 M3 

it » no 

20 m m 

n i ifl 

z< jo O 

24 It* 113 

20 Vtf 1«f 

23 30 49 

« M ti 

23 41 6 

30 W «F 
20 i» 30? 
20 2ie t» 
4 »«6 
M 230 TBf 
33 03 * J 
40 gS 137 

33. tT m 

U 130 141* 

» 1(4 314 

M » T7. 

2a in Hf- 

so m •« 



430 I* 





: |_ 



|C4:.10 333 244 420 hf 



CATCAT1 


XHTOMTiHAMTIUUWUUUUUUUWMAlTKAi 


FTMCATft 




TTTATAT1 






TQAJUKX 


IC»mTWaWTTC«7T»TlCT*AJUUU?Ti7rrT>-J 


ktj iuiiu 
























•MMMMM ■ MM 




fo 

1 

Hi 


<y*r|* pww ccnccnacATTCTaccdA 


314 


iB 


M1T9 
MJ14 


in 

•1.111 


11H(S 


H 

2 

R* 


vWH pnnt' CCTCCTGOTTCTGCCOA 


m 


1* 

M 


HJ14 


•1.111 
•1111 


112 Sp 


J * 
M 


nMrapmw mwmcccocacaoacac 

v»n4>pinw CCTCCTGOTTCTQCCM 


m 

324 


i8 

11 


M5TI- 
M4I14 


•1.111 
4i i« 




4 

R« 


v*ri* pn»* CCTCCTCCATTCTOCCCIA 


SB 
324 


M 

IS 


HI 414 


A1.1H 

•1 in 




Fu 

■ 


nMrapmw cacaoacacccqcacaoa 
«n«pw ccrccTcoTTCTGocoA 


H3 

324 


19 

H 


»5T» 

Hl4M 


•Lin 
•i in 





Figure 4 The web layout for SSR analysis and primer design of 'FishMicrosat'. (A) Species specific SSR analysis and primer design (B) repeat 
analysis output (C) SSR specific primer design output. 



throughout all sequences. A repeat type index has also 
been included to display all the repeats and their frequen- 
cies in a table. The dinucleotide repeat type selected as 
default displays 12 combinations of dinucleotides. The 
maximum frequency of each type of nucleotide re- 
peats (mono to hexa) can be viewed in the pie diagram 
(Figure 5E). The largest frequency for a mononucleotide 
repeat is T with 129 occurrences, dinucleotide repeats 
AC with 998 occurrences, trinucleotide repeats 'CAT' 
with 48 occurrences, tetranucleotide ATCT with 56 oc- 
currences, pentanucleotide 'TTATC with 2 occurrences 
and hexanucleotide 'CACACT with 4 occurrences. The 
database with 4398 sequences of 41 species has 277 mono, 
4207 di, 610 tri, 554 tetra, 15 penta, 11 hexa and 279 
compound repeats (Table 1). This section also analyzes 



information on the occurrence of the most frequent and 
rare nucleotide repeats in the fish genome. The dinucleo- 
tide repeats AC|TG (998|909) and CA|GT (881|686) were 
frequently found while CG|GC (9|7) were rare in fish 
genome. 

Conclusions 

FishMicrosat is a database of microsatellite sequences of 
commercially important fishes including shrimps and 
currently covers 4398 specimen records for 41 species. 
The database facilitates mining of SSR motifs, repeat ori- 
entations and sequence similarities. The statistics pre- 
sents the relative abundance of microsatellite repeats 
that occur frequently in the genomes. Additionally, it 
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Locus finder; It enables the user to search for identical microsatellite locus/loci conserved among 
different species for their possible use in cross-species amplification. 

Repeat similarity search: It uses a query microsatellite sequence as an input and retrieves all the 
microsatellite sequences that match with the query. 

Statistics: It generates frequencies of different type of microsatellites repeats from the records 
available m FishMicrosat. 
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Figure 5 The different web layouts of 'FishMicrosat 7 for data retrieval and analyses. (A) Home page (B) Motif search (C) Locus finder 
(D) Repeats map (E) Repeats statistics. 



facilitates in identifying polymorphic loci across species 
and designing primers for repeat loci, thus providing re- 
searchers ready to use information from a centralized 
location, avoiding the cumbersome process of referring 
to multiple sources of literature and using multiple 
programmes. This repository with included tools can 
play a key role in cutting edge areas of research by assis- 
ting with marker selection, linkage mapping, population 



genetics, evolutionary studies, genetic relatedness among 
the species and genetic improvement programmes of 
important aquaculture species. 

Availability and requirement 

FishMicrosat is freely accessible at URL http://mail. 
nbfgr.res.in/fishmicrosat/ for research and academic use. 
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